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PREFACE 


This book contains the abstracts of the works presented in the 7th Doctoral 
Workshop in Computer Science and Mathematics - DCSM 2022. It was 
celebrated in Universitat Rovira i Virgili (URV), Campus Sescelades, Tarragona, 
on March 31, 2022. The aim of this workshop is to promote the dissemination 
of ideas, methods, and results developed by the students of the PhD program 
in Computer Science and Mathematics from URV. It has been jointly organized 
by the research group of Intelligent Robotics and Computer Vision (IRCV) and 
the Doctoral Program on Computer Science and Mathematics of Security 
of URV. 


The editors and organizers invite you to contact the authors for more detailed 
explanations and we encourage you to send them your suggestions and 
comments that may certainly help them in the next steps of their PhD thesis. 
We thank all the participants and, especially, the students that presented their 
work in this DCSM workshop. Finally, we also want to thank Universitat 
Rovira i Virgili, the Departament d’Enginyeria Informatica i Matematiques 
(DEIM), and the Escola Técnica Superior d’Enginyeria (ETSE) for their support. 


Mohamed Abdel-Nasser, Oriol Farras, Doménec Puig, and Hatem A. Rashwan 


DeepKey: Watermarking Deep Learning Models 
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Abstract. Many organizations devote significant resources to building high-accuracy deep learning 
(DL) models. Thus, they have a great interest in protecting their trained models from being stolen 
or misused. Embedding watermarks (WMs) in DL models is a useful means to protect their owners’ 
intellectual property (IP). This paper proposes DeepKey, a novel watermarking framework for DL 
models. We leverage multi-task learning (MTL) to learn the original classification and watermarking 
tasks jointly. Empirical results show that DeepKey can preserve the utility of the original task and 
embed a robust WM. 


Keywords: Deep learning models; Ownership; Intellectual property; Watermarking. 


1 Introduction 


Deep learning models’ owners, such as technology companies, devote significant resources 
to train their models on vast amounts of proprietary training data, whose collection also 
implies a significant effort [1]. Thus, they seek compensation for the incurred costs by 
reaping profits from commercial exploitation [3]. Due to the competitive nature of the 
technology market, a stolen or misused model is clearly detrimental to its owner on both 
economic and competitive terms. Therefore, legitimate owners need a robust and reliable 
way to prove their ownership of DL models in order to protect their intellectual property 
(IP). Embedding watermarks (WMs) in DL models is a useful means to protect their 
owners’ intellectual property (IP) [2]. 

We propose DeepKey, a novel watermarking framework that allows owners to embed 
reliable and robust digital WMs in their DL models. Extensive experiments show that Deep- 
Key can successfully embed robust WMs with reliable detection accuracy while preserving 
the accuracy of the original task. The remainder of the paper is organized as follows. Sec- 
tion 2 presents an overview of our framework. Section 3 reports the experimental results. 
Finally, Section 4 gathers conclusions and proposes several lines of future research. 


2 The DeepKey framework 


The key idea of our framework is to perform two tasks at the same time: the original 
classification task Tyg and the watermarking task Tym. Fig. 1 shows the global workflow 
of DeepKey. 

Watermark embedding. DeepKey takes four main inputs in the WM embedding 
phase: the target model (pre-trained or from scratch), the original data set, the owner’s 
WM carrier set, and the owner’s information string. The output is the marked model, 
corresponding private model, and the owner’s signature. First, the WM carrier set samples 
are signed using the owner’s signature. After that, the signed WM carrier set is combined 
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Fig. 1: DeepKey global workflow. 


with the original data set, and they are used to fine-tune (or train) the targeted model. 
Finally, the private model takes the final predictions of the original model as inputs and 
outputs the position of the owner’s signature on the WM sample. We leverage MTL to 
train the two models jointly. 

Watermark extraction and verification. To extract and verify the ownership of a 
remote black-box DL model, the owner first delivers the WM carrier set and her signature 
to the authority. She also tells the authority about the methodology used to sign the WM 
samples and the predefined positions where the WM may be placed. Next, the authority 
(i.e., the verifier) randomly chooses a sample from the carrier set, puts the signature in a 
random position, queries the suspicious remote DL model and sends the model’s predictions 
to the owner. The owner (7.e., the prover) takes the predictions, passes them to her private 
model, and tells the authority the position of her signature on the image. The authority 
repeats the proof as many times as she desires. After that, the owner’s answer accuracy 
is evaluated according to a minimum threshold. If the owner surpasses the threshold, her 
ownership is regarded as proven by the authority. 


3 Experimental results 


Original and watermark tasks data sets and models. We used the CIFAR10 data 
set for the original task while we used STL10 as a WM carrier set. We used ResNet18 
and VGG16 DL models for the original task while we used a simple DL model (with 496 
learnable parameters) as a private model. Fig. 2 shows some examples of signed carrier set 
images and their corresponding labels. 

We used accuracy as a performance metric for the original task and the WM task. We 
set the required threshold T = 90% to prove model ownership. In the following, we evaluate 
the fidelity, reliability and integrity of DeepKey. Also, we assess its robustness against two 
types of attacks: fine-tuning [4] and model compression [5]. 

Fidelity and reliability. Embedding the WM should not decrease the accuracy of 
the marked model on the original task. As shown in Tab. 1, DeepKey did not degrade the 
accuracy of the original task and successfully embedded the watermark. This is thanks to 
the joint training, which simultaneously minimizes the loss for the original task and the 
WM task. Also, it shows that legitimate owners were able to reliably prove their ownership 
with accuracy greater than 90%. 
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Fig. 2: Examples of signed STL10 carrier set images employed with the CIFAR10 data set. 
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Integrity. DeepKey yields low WM accuracy detection with unmarked models, and 
thus it does not falsely claim ownership of models owned by a third party. In our experi- 
ments, there were 6 classes for the watermarking task. Looking at Tab. 2, the accuracy of 
falsely claimed ownership of unmarked models is not far from guessing 1 out of 6 numbers 
randomly, which equals approximately 16%. 


Table 1: Accuracy of the original and the WM tasks 


Marked model WM detection 


Unmarked model 


Benchmark 


accuracy % 


accuracy % 


accuracy % 


By finetuning 


From scratch 


By finetuning 


From scratch 


(30 epochs) 
92.07 
90.52 


(250 epochs) 
92.53 
91.74 


99.97 
99.89 


[CIFAR10-ResNet18 
[CIFAR10-VGG16 


91.96 
90.59 


99.96 
99.68 


Table 2: Integrity results with unmarked models. Each private model was tested with two 
different unmarked models: one model has the same topology as its corresponding marked 
model, the other one has a different topology. The last four columns show the accuracy 
detection obtained with the unmarked models. 


. Watermark detection accuracy 
Watermark detection accuracy ‘ 
Dataset [DL model a with unmarked models % 
with marked models% : 
Same topology|Accuracy|Different topology|accuracy 
[CIFAR10 ResNet18 99.97% ResNet18 18.92% VGG16 19.80% 
[CIFAR10| VGG16 99.89% VGG16 7.92% ResNet18 12.32% 


Robustness. Tab. 3 show that DeepKey was robust to the fine-tuning attacks for a 
number of fine-tuning epochs ranges from 50 to 200. Fig. 3 shows that DeepKey is robust 
against model compression, and the accuracy of the WM remains above the threshold 
T = 90% as long as the marked model is still useful for the original task. 


4 Conclusion 


We have presented DeepKey, a novel watermarking framework that enables DL model 
owners to embed robust watermarks in their models while preserving the accuracy of the 
main task. As future work, we plan to extend DeepKey to watermark federated deep neural 
networks. 


Accuracy % 
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Table 3: Robustness to model fine-tuning 


Benchmark CIFAR10-ResNet18}CIFAR10-VGG16 
# of epochs 50 | 100} 200 50 | 100 | 200 

Marked model accuracy %/92.40/92.30) 92.47 |91.31]91.64/ 91.69 
WM detection accuracy %/98.19/98.05) 99.12 |97.20|94.72| 96.67 
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Fig. 3: Robustness against model compression 
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Abstract. Deep learning (DL) models are being used to solve various critical tasks 
in the past few years. However, those models are ambiguous in terms of how their 
predictions are made. For the end users to trust those models, the end users should 
have the ability to generate local explanations of the predictions made by the DL 
models. In this work, we present a novel approach allowing an end user to locally 
generate explanations for a DL classification model accessed via a provider’s API. 
We approximate the provider’s model with a local surrogate model. We then use the 
surrogate model’s components to locally generate explanations. 


Keywords: End-user explanations; Deep learning; Counterfactual explanations. 


1 Introduction 


Building highly accurate Deep learning (DL) classification models requires a 
large amount of training data, whose collection and labeling involve a signifi- 
cant effort. Therefore, small businesses and ordinary users, who cannot afford 
this effort, resort to big technology companies that provide paid API access to 
highly accurate DL models via Machine Learning as a Service (MLaaS) plat- 
forms [4]. These end users then query those models with their (small) data 
and obtain the final classification predictions. 

Even though end users are interested in using MLaaS with highly accurate 
DL models, they may not entirely trust such models due to the lack of trans- 
parency of DL predictions. Obtaining explanations alongside predictions helps 
end users understand why a DL model produces a specific prediction, which 
increases the trust in the model and contributes to clearer decision-making [1]. 

We propose a novel approach that allows an end user to locally generate 
DL model-specific explanations for a DL classification model accessed via a 
provider’s API. The approach consists of two main phases: i) approximat- 
ing the provider’s model by a local surrogate model, using the small portion 
of data owned by the end user and ii) using the surrogate model to locally 
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generate DL model-specific explanations that approximate the explanations 
obtainable with white-box access to the provider’s model. The remainder of 
the paper is organized as follows. Section 2 presents an overview of our pro- 
posed method. Section 3 reports the experimental results. Finally, Section 4 
gathers conclusions and proposes several lines of future research. 


2 Explaining deep learning classification model predictions on 
the user’s side 


The importance of our proposed method lies in allowing users to reliably 
understand how the providers’ models make their predictions and determine 
whether these predictions are trustworthy. 

In the first phase, we need to approximate the provider’s model by a local 
surrogate model having an accuracy as close to that of the provider’s model 
as possible. However, the end user does not own enough data to train such a 
model. To tackle this challenge, we employ a modified version of the Mixup 
method [6] to augment the user unlabeled data and obtain more representative 
training data. Once the user obtains the augmented data, she queries the 
provider’s API to label the data. Afterward, due to the model knowledge 
from a complex “master” to a simpler “student” model being transferable [3], 
the end user trains a local surrogate model using the labeled data she recently 
acquired. 

Once the end user obtains the trained local surrogate model she can use it 
to generate accurate explanations for the predictions of the provider model. 
Since the surrogate model has almost learned the same decision boundaries 
as the provider’s model, explanations generated using the surrogate’s inter- 
nal components can be expected to accurately approximate the explanations 
generated using the provider’s internal components. 


2.1 Generating the explanations. 


In our work, we explain the provider’s model by generating counterfactual ex- 
planations [5] of a specific example. Counterfactual explanations tell us how 
to change the example’s features so that its predicted label also changes. In 
this way, we can understand how the model makes its predictions and explain 
individual predictions. We use adversarial training [2] as a means to generate 
adversarial examples that counterfactually explain the model predictions. In 
fact, adversarial examples are aimed at fooling the model rather than explain- 
ing it, but, in the end, they serve the same purpose as counterfactual examples 
by slightly changing the features of input examples to modify their predicted 
labels. 
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3 Experimental results 


Models and data sets. We use the gender classification and MNIST data 
sets to test the performance of the proposed method on image data. In all the 
experiments we took the surrogate model to be simpler than the provider’s 
model, which can save training time and at the same time retains most of the 
provider’s model knowledge. 

We used the following evaluation metrics to measure the performance of 
the trained surrogate models and the generated explanations: 


e Accuracy: We used this metric to measure and compare the performance 
of the provider’s and the surrogate models. 

e Structural Similarity Index Measure (SSIM): We used SSIM to measure 
the similarity between the explanations provided by the surrogate model 
and the ones generated by the provider’s model for image data. 


Accuracy of surrogate model. Table 1 reports the accuracy of the 
provider’s models and the trained surrogate models. We can see that the 
performance of the surrogate models was nearly equivalent to that of the 
provider’s model. 


Table 1. Accuracy of the Provider’s model and surrogate model. 


Data set|Provider model/Surrogate model 
Gender 96.3% 94.47% 
MNIST 99.22% 96.1% 


Surrogate model explanation. Table 2 reports the average SSIM for the 
Gender and MNIST validation images. We can see that the surrogate model 
generates explanations (counterfactual examples) with very high similarity to 
those generated by the provider’s model, which indicates that the surrogate 
model is properly approximating the provider’s model. 


Table 2. Similarity between the adversarial examples generated by the surrogate 
model and those generated by the provider’s model on the Gender and MNIST data 
sets. 


| Gender |96.79% 
[MNIST 98.27% 


Figure 1 shows two examples of these visual explanations generated for 
the Gender and MNIST data sets. By looking at the pixels that caused the 
prediction to change, we can see that, in general, the explanations generated by 
the surrogates were consistent with those generated by the provider’s model: 
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Gender classification data set MNIST data set 


Predicted -> Changed _ Original image Provider model Surrogate model Predicted -> Changed Original image Provider model Surrogate model 


male -> female 


female-> male 


Fig. 1. Visual explanations generated by the surrogate model in comparison with 
those generated by the provider’s models. 


4 Conclusion 


We have presented a novel approach that enables the end user to locally 
generate explanations on the predictions of the provider’s model. As future 
work, we plan to test the performance of our approach on other computer 
vision tasks, such as detection and segmentation, as well as natural language 
processing. 


Acknowledgement. We acknowledge support from the European Commission (projects 
H2020-871042 “SoBigData++” and H2020-101006879 “MobiDataLab” ). 


References 


[1] Blanco-Justicia, A., Domingo-Ferrer, J., Martinez, S., Sanchez, D. Machine learn- 
ing explainability via microaggregation and shallow decision trees. Knowledge- 
Based Systems, Knowledge-Based Systems, 2020. 


Bruna, J., Szegedy, C., Sutskever, I., Goodfellow, I., Zaremba, W., Fergus, R., Er- 
han, D. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. 
2013. 


Ni 


[3] Hinton, G., Vinyals, O., Dean, J. Distilling the knowledge in a neural network, 
arXiv preprint arXiv:1503.02531, 2015. 


[4] Ribeiro M, Grolinger K, Capretz MA. Mlaas: Machine learning as a service. [EEE 
14th International Conference on Machine Learning and Applications (ICMLA) 
(pp. 896-902). IEEE. 2015. 


[5] Wachter, S., Mittelstadt, B. and Russell, C. Counterfactual explanations without 
opening the black box: Automated decisions and the GDPR”, Harvard Journal 
of Law and Technology 31(2017) 841. 


[6] Zhang, H., Cisse, M., Dauphin, Y. N., Lopez-Paz, D., mixup: Beyond empirical 
risk minimization, International Conference on Learning Representations. 2018. 


Distance and Size Calculation of the detected 
objects on Floor from robot using Bounding Box 


Aditya Singh * 


Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili 
Tarragona, Spain 
aditya.singh@urv.cat 


1 Introduction 


This work aims to develop a mathematical relation between the position of an 
object in a two dimensional image plane and three dimensional world space 
with respect to a robot mounted camera by using object detection bound- 
ing box coordinates. In human-centric robot navigation, it is very necessary 
to make a perception of object distance and position with respect to robot. 
It uses the object detection information in the form of bounding box by us- 
ing monocular vision and uses the robot kinematic parameters to establish 
a mathematical relation between object and robot camera. The position of 
the object is calculated in two fold: one is by calculating the distance in front 
direction and other is by doing side positioning. 


2 Methodology 


The process is tested on a Locobot Robot (PyRobot [1] platform, developed 
by Trossen Robotics). The robot uses a Intel Realsense camera for vision. 
The process uses YOLOvsS algorithm for object detection. It uses Manhattan 
World Assumption [2] for defining the floor as a horizontal plane and in 3D 
world, all the pixels from the floor lie in a single plane. 


2.1 Object Detection and ground object discovery 


A YOLOv3 model takes an Image I(z,y) as input and predict the objects 
present in the image. The output of the object detection is (H;, W;,C;), which 
are the dimensions of the bounding box for 7 object. For identifying the ground 
located objects, bounding box dimensions play a crucial role. As shown in 
figure 1 and figure 2, the pixel or pixel height (H;) which corresponds to 
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the camera height world space are taken as reference for the calculation. The 
peculiarity of this pixel height is its 2D nature i.e. the height of the world 
points corresponding to this height will not change in image plane of a camera 
for a 2D motion. The objects, whose lower side of bounding box is below this 
line is considered as a floor located object. This assumption works well for 
major cases due to low height of the camera. 


image Object 


Fig. 1. Side view for relation between camera view and environment. 


(H, W) 


Fig. 2. Image plane information for object detection output. 


2.2 Distance Measurement 


After floor object identification, bounding box coordinates are used for calcu- 
lating its position on ground. The height of the lower side of bounding box is 
C+ H;/2 and represented by ’H7’ (all the distances are measured with respect 
to origin of the image plane (0,0) as shown in figure 2). The horizontal field 
of view (FoV) of camera is ’6;,’, vertical FoV is ’0,,’, the height of the camera 
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is ’h’, the angle made by lower side of object bounding box and lower side of 
’6,’ on camera lens is ’0,’ and angle made by ’H;’ and lower side of 6,’ is ’6;’. 
By perpendicular triangle law the distance between the camera and robot is 
given by, 

Z = h/tan(Op, — 90) (1) 


and 76,’ is calculated as, 


On = Oy x H/(H — Ht) (2) 


2.3 Object Size Measurement 


The distance (Z) of the object from the robot will become the reference for 
calculating the height, width and its placement in terms of left or right. The 
projection of the object is considered perpendicular to the floor. This calcu- 
lation considers the focal distance ’f’ of the robot camera and by using focal 
distance and ’Z’ every point of the image plane can be mapped in the real 
world by using, 


(X,Y) = Z x ((x0, yo) — (ai, vi) /f (3) 


where, (20, yo) are the coordinates of the centre of the image plane and (X,Y) 
is the deviation of the point from the center line of sight of the robot camera. 


3 Progress 


This idea is used for Locobot robot and used for Semantic Mapping of indoor 
environment. The results are good and the most interesting thing is its light 
functionality. It can run on any kind of robot processor for calculating distance 
with objects. In figure 3, distance prediction results are shown for images taken 
from laboratory environment. 


Fig. 3. Results for distance measurement of detected objects 
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1 Introduction 


The concept of quality of life (QOL) is difficult to operationalize. However, 
the recent development in this area reports that improved quality of life is a 
realistic and obtainable goal for everyone, including people with intellectual 
disabilities (ID). This work aims to analyze the dataset recorded during an 
interview of an individual, older people, or people with intellectual disabili- 
ties. The interviewer asked questions related to the dimensions of QOL. Many 
research works [1], [2] propose eight dimensions of QOL. These eight dimen- 
sions are Emotional Well-being (BE), Interpersonal Relation (IR), Material 
Well-being (BM), Personal Development (DP), Physical Well-being (BF), Self- 
determination (AU), Social Inclusion (IS), and Rights (DR). Each dimension 
has four to six objective questions related to that Dimension. Based on the 
answers of each dimension of an individual, an interviewer who is a profes- 
sional gives an index value of QOL. The index value shows the output of the 
corresponding eight dimensions of the quality of life. We have interviewed a 
total of twenty-six individuals and built a dataset. We use a multiple linear 
regression model to analyze the dataset. 


2 Methodology 


This work is motivated to build a learning machine that can evaluate the 
quality of life of an individual and, based on evaluation, suggest the possible 
support in the particular Dimension. Figure 1 reports the complete layout 
of the work. It starts with the individual’s interview and asking questions 
related to the eight dimensions of QOL. The answer records based on 3 points 
Likert scale and recorded the converted answers for each individual in every 
eight dimensions. Furthermore, an expert gives corresponding support index 
values based on input values. In step two, we recorded the data for each 
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individual. We prepare the data for a learning algorithm by pre-processing it. 
After training, the model predicts the expected support index value for new 
incoming QOL dimension values. Based on the support index value, we detect 
the deficiencies in any dimension of QOL, and professionals make an action 
plan to improve one or more aspects of QOL. This process collectively runs 
timely and tries to enhance the quality of life of individuals, including the 
person with intellectual disabilities. 


Questions related 
QOL dimensions 


_ sist 


Fig. 1. Complete architecture to evaluate Quality of Life of an Individual 


2.1 Dimension of Quality of Life and Support Intensity Scale 


The development in this area motivates the dimension from striving to define 
QOL to focusing its basic dimensions [1]. It shows that QOL is a multidi- 
mensional phenomenon than an individual trait. People’s QOL is affected by 
the interaction between personal and environmental factors. Therefore, Its 
evaluation is based on subjective and objective measures. Recent research [3] 
depicts the development of a new paradigm that integrates QOL with support 
(QOLSP). 


Dimension of Quality of Life Area of Scale of Support Intensity Scale 
Emotional Well-being Health and Healthcare, Protection and defence, and Behavioral support need 
Interpersonal Relations Social activities 
Material Well-being Employment activities 
Personal Development Homelife activities, life long learning 
Physical Well-being Health and Healthcare, Exceptional medical need 
Self-determination Protection and defence 
Social Inclusion Community life activities, Social activities 
Rights Protection and defense, Health and Healthcare 


Table 1. Dimension of quality of life and area of the support intensity scale. 


The multidimensional concept of QOL proposes the various factor to decide 
the dimension of the QOL. These factors are independence, social participa- 
tion, and well-being [2]. Based on these factors, dimensions are following shown 
in Table 1. The support corresponding to these dimensions is also shown in 
the second column of Table 1. These dimensions encompass every part of a 
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person’s personality. Various indicators show that the area needs to work to 
improve the quality of life dimension. 


2.2 Multiple Linear Regression Model 


MLR (Multiple Linear Regression) is a popular regression algorithm for solv- 
ing scenarios with multiple input attributes. QOLSP contains eight input 
dimensions, each with its own support index, and MLR predicts the corre- 
sponding support index value based on these eight input dimensions of an 
individual’s QOL. This algorithm is implemented using the equation below. 


Y; = Wo + Wi? Xi + We? Xin + Wa? X33 +... + Wy? Xip + (1) 


Where Y; represents the support index, X; represents the QOL dimensions, 
Wo represents the bias, and W, represents the slope coefficients for each QOL 
dimension. The model error is shown by epsilon. We divided the dataset into 
80 percent for training and 20 percent for testing. 


3 Progress 


This study utilizes a machine-learning system to predict the corresponding 
support index value. Furthermore, with the assistance of dimension special- 
ists, we must create an action plan that corresponds to the support value 
and provide it to the individual with a matching action sheet. We calculated 
train case accuracy and subsequently test case accuracy in the form of mean 
absolute error, root mean square error, and R? score after training the model 
with the training dataset, as shown in table 2. 


Evaluation Matrices |Train Case|Test Case| 
Absolute Mean Error | 0.490262 | 1.350473 | 
Root mean square error| 0.635070 1.472938 | 
R? score 0.998192 0.991830 | 


Table 2. Evaluation matrices for quality of life evaluation 
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1 Introduction 


We want to address the problem of Diabetic Retinopathy (DR) classification. 
As a consequence of diabetes, the blood vessels of the eye may break and 
generate small blood spots, hemorrhages and exudates. These lesions produce 
vision loss and may even cause blindness if they are not detected and treated 
at an early stage. 

In this PhD thesis, the goal is to improve a DR classification model based 
on a Fuzzy Random Forest (FRF) used in the Retiprogram system [2] [3]. It 
uses the clinical data of the patients to assess their risk of developing DR. 
This model is being tested by a group of ophthalmologists at Hospital Sant 
Joan de Reus. The general results are good (with a sensitivity and a specificity 
over 75%), but there are still many miss-classifications. Errors are mainly due 
to the inherent ambiguity of the training examples (very similar patients can 
belong to different classes) and to the high unbalance between both classes 
(more than 90% of diabetic patients do not develop DR). 

We are constructing a methodology to take advantage of the data of the 
new patients which are treated at the hospital. We propose to modify the set 
of trees that compose the FRF, which will allow updating the model without 
retraining the base model from the ground up. 


2 Proposed method 


The proposed architecture is illustrated in Fig. 1. Each time a sufficiently 
large set of new cases is collected, the updating method is applied to improve 
the classification model. 

As a first dynamic component, we consider the ensemble voting procedure 
of the fuzzy random forest. Two methods are usually applied: majority and 
weighted voting. They have both been studied. 
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As a second dynamic component, we have the new data collected for up- 
dating the rules in the random forest. To improve results, the use of previous 
miss-classified examples (i.e. errors) is proposed. Three different ways of deal- 
ing with errors are studied, named: no errors, errors and all errors, depending 
on which error examples are used during the dynamic updating. 


Base model New model 
training training 


‘mamma 


Train base model 


eae Ame 


Dynamic update 
of the model 


Model dynamic 
update eee eeoeal! 


Fig. 1. Architecture of the iterative learning of Fuzzy Random Forests 


The overall updating architecture is composed of three steps. The first one 
is not iterative, and the other two steps are run in iterations each time the 
model has to be updated. The three steps are briefly explained next. 


1. Base model training: The first stage consists on training the base model 
with a large training dataset, 7, obtaining n fuzzy decision trees, where n 
is a large number, usually more than 100. During the construction process, 
the out-of-bag samples of each fuzzy decision tree, are used to compute 
for each of them their specificity and sensitivity. Those metrics are used in 
the weighted voting, and in the update process. The training dataset can 
also be used for testing, and the samples that are not correctly classified 
are stored in Ey. Those errors samples Fr are used in the following stage. 
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2. New model training: Every time enough new samples D; have been 
gathered, around 200 samples, a new training iteration 7 is performed. 
The merge process generates the dataset used to update the fuzzy ran- 
dom forest, D‘, which depends on the method version. The errors version 
merges the errors data from previous iterations with D;. For the first iter- 
ation 7 = 1, the Er errors samples are merged. For further iterations, the 
Fp; samples generated in the third step of the method are merged. The all 
errors version also merges the training data D; from previous iterations. 
Finally, the Dj samples are used to train a new fuzzy random forest with 
a lower amount of trees m, around 20. Their out-of-bag samples are also 
used to compute the aforementioned metrics for each of the new trees. 

3. Dynamic update: The m fuzzy decision trees trained in the previous 
step are used to update the current model. They are added to it, and to 
improve its performance, the worst fuzzy decision trees are removed. This 
is fixed by a certain percentage p. To sort the trees and keep the best 
ones, the weighted balanced accuracy is used. It is defined as an average 
between specificity and sensitivity with a weighting factor a. 

After pruning the worst trees, an additional update weights process can 
optionally be performed. The weights computed using the out-of-bag sam- 
ples are updated with the training data D} of the current iteration. 

The resulting fuzzy random forest is set as the current model, and it is 
taken as the new model to be used until a new set of cases is available, 
and a new update iteration starts. 

The errors of the updated fuzzy random forest model on the D/ dataset 
may be also retrieved and stored in Ep; as it was done for the base model 
with Er, so they can be used in subsequent iterations. 


The use of error samples has two purposes. On the one hand, to increase 
the size of the training set D/ and, on the other hand, to show again this 
wrongly classified cases to the model in order to be able to build new rules 
that cover them appropriately. 


3 Experimental results 


Experiments are mainly done with the DR dataset obtained from the hospital, 
which is continuously increased with the new visit’s to the patients. However, 
to validate the methods proposed, we have also used the occupancy dataset 
[1] from the UCI public repository, in which the occupancy of an office room 
is predicted. 

The data from both problems is split in 3 different datasets: training, vali- 
dation and testing. The validation set is split in different batches to simulate 
the new data that continuously arrives to the system. While, the testing set 
is used after each iteration to check the performance of the updated FRF. 
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Fig. 2 shows the evolution of the updated model in the test set after each 
iteration. The metrics on the DR dataset gradually improve during all itera- 
tions. Moreover, the sensitivity is the metric which increases the most, as it 
was desired. In the occupancy results, the sensitivity gradually improves, and 
the specificity ends slightly decreasing. Even though it is not as desired as 


improving both metrics, it is still desired for our use case. 
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Fig. 2. Metrics on the test set. Diabetic Retinopathy (left) and Occupancy (right) 
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1 Abstract 


Depth estimation is a challenging task of 3D reconstruction to enhance the 
accuracy sensing of environment awareness. Recently, convolutional neural 
networks (CNN) have demonstrated their extraordinary ability to estimate 
depth maps from monocular videos. However, traditional CNN does not sup- 
port a topological structure, and they can work only on regular image regions 
with determined size and weights. On the other hand, graph convolutional net- 
works (GCN) can handle the convolution on non-Euclidean data, and it can be 
applied to irregular image regions within a topological structure. Therefore, 
to preserve object geometric appearances and objects locations in the scene, 
in this work, we aim to exploit GCN for a self-supervised monocular depth 
estimation model. Our model consists of two parallel auto-encoder networks: 
the first is an auto-encoder which extract the feature from the input image and 
on multi-scale GCN to estimate the depth map. In turn, the second network 
will be used to estimate the ego-motion vector (i.e., 3D pose) between two 
consecutive frames based on ResNet-18. The estimated 3D pose and depth 
map will be used to construct the target image. 


2 Introduction 


In the Artificial Intelligence (AI) field, especially deep learning (DL) networks 
have accomplished high performance in various depth estimation and ego- 
motion prediction tasks, and nowadays, it is steeply expanding. The impor- 
tance of depth estimating, as a pull factor for the entry of modern technolo- 
gies into self-driving vehicles (1), object distance prediction [7]. Besides, depth 
maps can be used for underwater machine vision and robotic perception [9]. 

The stereo vision system is one of the common techniques is used for depth 
estimation. However, in order to save cost and computational resources, many 
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methods have been presented to perform depth estimation based on a monoc- 
ular camera. The monocular depth estimation methods can be divided into 
two categories in terms of the learning approach: supervised learning meth- 
ods |3| and unsupervised learning methods (4). Most of existing DL monocular 
depth estimation networks use convolutional neural networks (CNN) to ex- 
tract the feature information. However, CNN is limited, since it does not 
consider the characteristics of the geometric depth information and object lo- 
cation and contextual features in the scene. Besides, there is recently a need 
to extend deep neural models from Euclidean domains achieved by CNNs to 
non-Euclidean domains [2]. Thus, the research community has started to ob- 
serve the importance of DL networks based on graphs GE The effectiveness of 
the graph convolution network (GCN) has been proved in processing graph 
data on the tasks of classification and segmentation. Thus, in this work, we 
propose a novel architectural DL network based on GCN, that can help to 
advance monocular depth estimation. 


CNDepth 


atDepth | noDepth2 


Fig. 1: Depth from a single image. GCNDepth (our self-supervised model), 
produces high quality depth maps with clear background and sharp edges 
compare to state of the art self-supervised depth estimation. 


3 Summary 


Based on the brief survey above, and to avoid depending on ground-truth 
and more generalized monocular depth estimation, we will propose a self- 
supervised learning approach in this work. Our method will estimate the 
depth images and the ego-motion to increase the constraints of depth predic- 
tion. For monocular depth estimation, the relationship between object location 
and visual and contextual features in the scene is significant to preserve the 
objects’ boundaries. Most self-supervised monocular depth estimation meth- 
ods |5| are based on CNN-based networks that extract appearance visual fea- 
tures from whole scene images. However, in most cases, CNN-based networks 
yield blurred edges and boundaries of the objects. We used a standard CNN 
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encoder for visual feature extraction and a GCN decoder for reconstructing 
depth maps. The reason for using GCN as a decoder network is to improve the 
detection of sharp boundaries and reduce the background noise to compute 
precise depth maps with full objects details compared to the self-supervised 
state-of-the-art model. For the CNN-based encoder, most monocular depth 
estimation used The ResNet-50 network as a backbone for feature extraction, 
and they achieved high performance. Thus, we similarly use ResNet-50 for 
the depth estimation network in our encoder. For ego-motion estimation, we 
used the same network proposed in |8} that is based on ResNet-18 as a back- 
bone. In order to obtain more structural details in the scene, our approach will 
use a combination of different warping errors proposed in the state-of-the-art, 
such as the reconstruction error presented in to minimize the errors in 
the reconstructed image, the photometric reprojection error proposed in |5| to 
optimize the values which provide matching pixel intensities between the tar- 
get and reconstructed images. Finally, a combination between discriminative 
and curvature errors to highlight geometric characteristics of the objects 
and textured regions in the scene image. 
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Fig. 2: Schematic illustration of the whole framework 


A Overall Pipelines 


The proposed method consists of two main networks. The first network, called 
DepthNet. The source image is an input of the DepthNet, and the output is 
the depth map. The second network is PoseNet, a pose predictor to estimate 
the ego-motion vector of the source and the target images (in our case, a 
consecutive image). The output of PoseNet is the relative pose between the 
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source and target images. These two main networks provide geometry infor- 
mation to provide point-to-point correspondences of the reconstructed image. 
The whole architecture of our model is illustrated in Fig. 


5 Discussion 


The performances of our model compared with the state-of-the-art solutions is 
summarized in Table[1} As shown in Table[I] the GCNDepth method achieved 
the highest performance in terms of Abs-Rel, Sq-Rel, second and third ac- 
curacy of (62,63) evaluation metrics. In addition, the proposed method also 
achieved second best results in RMSE, RMSE-Log and first accuracy of (61) 
with a slight difference of 0.003 with RMSE-log, and 0.5% with 6; compared 
to the highest results achieved by (3). In general, the model of Featdepth 
and our model, GCNDepth, provided comparable results and they outperform 
the other tested methods. 

Although, the Featdepth model achieved similar results to our model, the 
GCNDepth model yields a 40% reduction in the number of trainable pa- 
rameters compared to the Featdepth model. Where the GCNDepth model 
has trainable parameters of 48,220,954, in turn the Featdepth model has 
79,681,406. Since the Featdepth model has an extra deep feature network 
for feature representation learning to cope with the geometry problem of self- 
supervision depth estimation. The comparable results show that the use of 
GCN in reconstructing the depth images can improve the photometric error 
that appeared in the self-supervision problem without using the feature net- 
work as proposed in [3]. The results have shown in Tablef1]supported that the 
use of GCN in estimating depth maps from a monocular video can yield depth 
maps outperforming or matching the state of the art on the KITTI dataset. 


Table 1: 
Comparison of different methods on KITTI dataset. Best results are in bold 
blue and second best results are in bold red color. 


Lower Better Higher Better 
Method Abs-Rel|Sq-Rel|RMSE|)RMSE-Log| 6, d2 63 


Monodepth2 [5 0.115 0.882 4.701 0.190 0.879 0.961 0.982 
FeatDepth gl 0.104 0.729 4.481 0.179 0.893 0.965 0.984 
GCNDepth | 0.104 0.720 4.494 0.181 0.888 0.965 0.984 
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1 Introduction 


The rapid advancement and development of new digital technologies has 
changed the dynamics of our daily lives by providing us with new services 
and products. Among the services, we can stress the social connectivity, in- 
formation storage and location (GSP and mapping). Regarding the products, 
Smartwatches and Smart Home Devices with an Intelligent Personal Assistant 
(IPA) are two examples that are becoming more popular worldwide. These 
services and devices generate huge amounts of information, which is processed 
by the Service Providers (SPs) in order to improve and develop new products. 
We cannot however ignore that also represents an important source of revenue 
for the SPs. As a result, their products can be cheaper or even free [1] 

The processing of the aforementioned information may result in extraction 
of sensitive information which can jeopardise the users’ privacy. Thus, the EU 
took a decisive step with the General Data Protection Regulation (GDPR) [2], 
that came into effect from May 2018, in order to protect users’ rights. The 
GDPR wants to mitigate the abuse of massive collection and processing of 
users’ personal data. The regulation guarantees specific privacy rights to Data 
Subjects (physical or legal entities to which the personal data belongs) ensur- 
ing that personal data ”can only be gathered legally, under strict conditions, 
for a legitimate purpose”; as well as bringing full control back to the data 
owners. 

Under GDPR, companies are required to prove compliance in case of suspi- 
cion of a violation or when a Data Subject (DS) lodges a complaint with the 
Supervisory Authority (SA). However, the legislative text does not specify 
how to transparently demonstrate that the information collected and its pro- 
cessing fulfills with the regulation. In the same way, DSs need tools to know 
and control what happens with their personal data. So, individuals have no 
tools to know transparently and easily which data is being collected and pro- 
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cessed and for which purposes. As a result, DSs are mostly limited to giving 
their consent beforehand, in a way that is based on an abstract clause. In this 
regard, the current GDPR-compliance verification architectures generally de- 
pends on each service provider, i.e. they are specific and centralized for each 
of them. Due to this reason, critical concerns on the lack of transparency have 
been imposed [3]. 

It is therefore necessary to deploy a framework in order to enable the agree- 
ment verification between the users (DS), Data Controllers (DC) and Data 
Processors (DP) in relation to the data custody and processing. At the same 
time, the users should be capable to know and control which data is being 
collected, who is processing it and for which purposes. From the DS point 
of view, the main benefit is a way to manage his personal data, which does 
not depend on the DC, i.e. the tool can be used to manage all agreements 
with SPs. In addition, from a DC and DP point of view, the main benefit is 
a proof that can be presented to SAs showing that data was obtained and 
processed in a GDPR compliant way. So, the proof should have the following 
properties: i) public access; ii) verifiable; iii) authentic; iv) immutable and; v) 
non-repudiable. According to these properties, some authors have proposed 
the use of Smart Contracts (SCs) implemented over the blockchain technol- 
ogy (BC) as a general-purpose data management [4-10]. This is a promising 
technology in GDPR-compliant personal data management. 


2 Contributions 


Our first contribution to this topic is a lightweight blockchain-based GDPR- 
compliant personal data management system, which provides public access 
immutable evidences showing the agreements between a Data Subject and 
a Service Provider about DS’s personal data. Compared to other existing 
research works, our work proposes a new conceptual design and system archi- 
tecture for human-centric personal data management, by using BC and SCs 
technology that is in compliance with the GDPR (see Figure 1). Our work 
differentiates between the data collection and data processing concepts by 
identifying the Controller and Processor actors and treating them in a related 
but separate way. We also try to reduce the overhead on DSs, as if they need 
to have a wide knowledge on BC technology or they have to be constantly 
operating over our platform, it will be hard to be accepted and used by the 
community. 

On the current work we are extending the preliminary scheme presented 
n [11]. In particular, the new scheme has been partially re-designed to be 
deployed more conveniently in a realistic setting. This includes: i) a modi- 
fication in the process flow that makes the DS the main responsible of her 
own personal data and the initiator of the whole proposed protocol; ii) the 
use of the well-known XACML framework to improve the robustness of the 
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Fig. 1. System Architecture 


access control process to DSs collected data; and iii) a refinement in the use 
of the SCs that allows us to include all the purposes of a certain DP in a 
single contract, thus, improving the general efficiency of the proposed system; 
Moreover, in the new proposal, we have done a more detailed experimental 
study, including the implementation of a realistic use case. 


Future Work 


Every actor needs an asymmetric key to use the proposed system, as digital 
signature is used to interact with Smart Contracts. The public key (PK) can 
be seen as an ID of the actor itself, so in order to keep DSs’ anonymity 
to possible linkage attacks, a new key pair is used for every consent with a 
different Data Controller. 

In order to make the asymmetric keys management abstract for DSs, as future 
work, we pretend to complement our work with a tool that allows them to 
generate, store and manage all asymmetric keys used to interact with the 
proposed system, in a transparent way. This tool must be multi-platform, 
secure and tramper-resistant as holds all IDs (PKs) a Data Subject uses in 
our system and their associated secret keys to interact with the generated 
agreements. 
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1 Introduction 


Early diagnosis of retinal lesions helps reduce the risk of visual loss and blind- 
ness. Ophthalmologists inspect eye fundus images to detect the signs of com- 
mon eye diseases like diabetic retinopathy (DR) and glaucoma. Figure 1 shows 
the most common types of lesions that may affect the retina. The yellow spots 
in the retina region stand for hard exudates (HX), pale yellow or white ar- 
eas with ill-defined edges stands for soft exudates (SX), tiny outpouchings of 
blood stands for microaneurysms, while the bleeding that occurs in the retina 
is known as haemorrhages. 


Microaneurysms i Soft Exudates 


Ne 


Hemorrhages | Hard Exudates 


Fig. 1. Retinal lesions types. 


Indeed, ophthalmologists dedicate many hours to perform manual analysis 
of hundreds of fundus images, which represents a high cost considering man- 
power needed and salaries [1], [5]. On the other hand, artificial intelligence- 
based computer-aided diagnosis (CAD) systems, if trained properly, can an- 
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alyze hundreds of fundus images and provide a diagnosis as experienced oph- 
thalmologists [2]. 

In this research, we leverage emerging deep learning technologies such as 
U-Net and Fully Convolutional Network (FCN) to automatically detect and 
segment various lesions in eye fundus images. 


2 Methodology 


In this study, we use a deep learning-based model termed gated skip con- 
nections [4] to distinguish and segment hard and soft exudates properly in 
fundus images. The model comprises five encoder blocks with two convolu- 
tional layers and five decoder blocks with four convolutional layers (Figure 2). 
In this model, an efficient skip connections technique is combined with the 
U-Net architecture’s decoder to retrieve eye-lesion-relevant information while 
disregarding irrelevant features. 
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Fig. 2. Eye lesion segmentation using deep gated skip connections. 


The Indian Diabetic Retinopathy Image Dataset (IDRiD) [3] is used to 
train and evaluate the eye lesion segmentation algorithm. IDRiD provides 
81 fundus images of the retina with excellent annotations for the optic disk, 
microaneurysms, hemorrhages, hard and soft exudates. The fundus photos are 
4288 x 2848 pixels in size. The database is divided into two standard sections 
for training and testing: 54 photos for training and 27 images for testing. It 
should be mentioned that the photos may depict a variety of different sorts of 
lesions. To take benefit from the high resolution of the images while minimizing 
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computing costs, we split each image into 12 splits during the training and 
testing phases. Data augmentation techniques are used to increase the number 
of training images. The model is trained using the binary cross-entropy loss 
function and the ADAM optimizer. 

Two segmentation models for eye lesions have been trained: one of them 
for hard exudates and another for soft exudates. 


3 Preliminary results and future work 


Table 1 presents the results of the two segmentation models and the state-of 
the-art models. With a hard exudates segmentation task (HX), the proposed 
eye lesion segmentation model obtains F1-score and area under the precision- 
recall curve (AUPR) of 75.9 and 84.8%, respectively. The Fl-score of the 
proposed model is 0.7 points better than the method proposed by Xiao et al. 
[6]. A soft exudates segmentation task (SX) achieves an F1-score and AUPR of 
68.7 and 75.0%, respectively. As one can see, the proposed model outperforms 
HEDNet+cGAN [6], and Saha et al. [7]. 


Table 1. Preliminary results and comparison 


Lesion Method Metrics (%) 
F1 | AUPR 
Hard exudates|HEDNet+cGAN [6]|69.0/ 84.1 
Saha et al. [7] 87.0 


Proposed 75.9} 84.8 
Soft exudates | HEDNet+cGAN |44.0] 48.4 
Saha et al. - 71.0 
Proposed 68.7| 75.0 


Figure 3 shows a sample of results of HX (top) and SX (bottom). The 
segmentation models views result is much closer to the ground truth. 


Input Image Actual Mask Predicted Mask 


Fig. 3. Segmentation results. 
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The future work will include using meta-learning deep-learning-based tech- 


niques to improve the segmentation results and performing segmentation of 
the other types of eye lesions like microaneurisms and haemorrhages. 
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1 Abstract 


Fundus image quality is critical to diagnosing retinal diseases since image 
clarity is significant in classifying such images. This work presents a new field- 
friendly multitasking framework for automatically interpreting base image 
quality based on the autoencoder network used to reconstruct the input image. 
The proposed system provides an interpretable quality assessment and quality 
visualization. In particular, the present application can detect the optical 
disc and pure structures as features to help the evaluation by coding. The 
experimental results have shown the superiority of the proposed approach 
over various modern methods. 


2 Introduction 


Modern medicine highlights big data to assess fundus image quality based on 
the human visual system. In ophthalmology, the use of fundus photography 
has been highlighted, which has given rise to indispensable applications of 
portable fundus cameras. However, in fundus photography, image quality is 
more susceptible to general quality distortions, such as color distortion, un- 
even lighting, low contrast, and stuttering. Digital fundus imaging is used to 
diagnose various eye disorders such as diabetic retinopathy (DR) [1], cataract 
[2], age-related macular degeneration (AMD) [3], and glaucoma [4]. 

Scientists focus on ways to obtain effective medical help for a large number 
of patients. However, the number of eye specialists available needed fails to 
meet the current demand . To address the lack of the required ophthalmolo- 
gists, telemedicine [5], and computer-aided diagnosis (CAD) [6] can be used 
at eye diseases diagnosis and prognosis. 

All CAD systems of eye disease diagnostic systems are based on the qual- 
ity of retinal images. The results of CAD systems with low-resolution images 
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degrade their decision-making performance. Thus, a trustworthy assessment 
of retinal fundus image quality is needed to improve the early detection of 
eye diseases. In this work, we propose a framework based on deep learning 
techniques, mainly a deep autoencoder network, to develop a reliable fundus 
image assessment. The model consists of two cascading networks: an autoen- 
coder network for self-supervision based on image reconstruction and a deep 
CNN classifier for classifying the quality of the input image. In the autoen- 
coder network, a multi-layer encoder will be used to extract local and global 
features related to the quality of retinal images and decode them to recon- 
struct the same input image. Then the features obtained from the encoding 
network are fed to the classifier to classify the quality of the network input 
images. 

In most CAD diagnosis systems After training the model, we analyze their 
representations via attribution and other interpretability methods. Our con- 
tributions to this paper are as follows: 


e We propose an auto-encoding network to correctly recognize the repre- 
sentative depth features of fundus images via the cryptographic network. 
The decoder part is used to reconstruct the input bottom image. 

e We suggest a CNN classifier fueled by the features learned by the encoder 
network to classify input fundus images as gradable or ungradable. 

e We suggest using a measure of mean square error (MSE) as a loss func- 
tion To train the automatic encryption network. The MSE loss function 
calculates the sum of Square the distance between the input image and the 
image reconstructed by the decoder. Also, we use the binary entropy loss 
function to train the CNN classifier. 

e We propose to integrate the losses of the two autoencoder networks and 
the CNN classifier into a single learning framework to solve the fundus 
image gradability problem. 

e We apply feature attribution and other interpretability methods to under- 
stand the representation of the fundus images in both models. 

e Our interpretability analysis indicates that the autoencoder loss helps the 
classifier focus more on the relevant structures of the fundus images, such 
as the fovea, optic disc, and main blood vessels. The normal model, on the 
other hand, uses more arbitrary input regions to determine the gradability 
of the image. 


3 Proposed Model 


Figure 1 gives a high level overview of the training and testing phases of our 
proposed model. In the first training model,we used an autoencoder net- work 
consisting of two serial networks: the encoder and the decoder.We used the 
encoder network to extract the high-level features of the model input fundus 
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images. Next stage after feature extraction these features will be fed to the 
decoder network so as to rebuild the same input image again. In the next 
stage,another network, the classifier network, will be fed with the features 
obtained from the autoencoder network to classify the retinal image quality 
into two categories: gradable and ungradable. The size of the input image 
was resampled to 480x480. In the testing phase of the model, we used only 
the trained encoder and classifier grid in order to classify the image quality 
of the fundus mesh in addi- tion to entering it into the interpretation phase, 
which is an important phase in testing medical images and classifying them 
as gradable and ungradable . 
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Fig. 1. General overview of the autoencoder model in train stage. 
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Fig. 2. General overview of the autoencoder model in test stage. 
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4 results 


Based on the two datasets, Table 1 and 2 show the results (i.e, Accuracy, Sen- 
sitivity, Specificity, Precision and F'l score) of MCF-Net and four variations 
of the proposed systems with the four loss functions. As shown in Table 1, 
the proposed model with its four variations outperformed the performance of 
MCF-Net in terms of the five evaluation matrices. Among them, our model 
with MSE as a loss function yielded the best results with F'l score, sensitivity 
and specificity of 0.88, 0.83 and 0.91, respectively. For instance, our model 
with MSE yielded an improvement of 8% with Fl score compared to the 
MCF-Net. In turn as shown in Table 2 and with the second dataset EyeQ, 
the proposed model and its variations also outperformed the results with 
MCF-Net. Our model with MSE achieved significant improvements of 16%, 
10% and 38% with F'1 score, precision and specificity, respectively. Besides, a 
small improvement of around 1% with sensitivity. 


Table 1. Comparison between the proposed model and MCF-Net [7] on the Eypces 
dataset [8] 


Accuracy Sensitivity |Specificity/Precision|F1-Score 
MCF-Net Model 0.81 0.64 0.95 0.84 0.80 
Our Model - SSIM Loss 0.815 0.95 0.65 0.84 0.82 
Our Model - MS-SSIM Loss} 0.86 0.94 0.76 0.87 0.86 
Our Model - MAE Loss 0.85 0.84 0.86 0.85 0.85 
Our Model - MSE Loss| 0.875 0.83 0.91 0.88 0.88 


Table 2. Comparisons of the proposed model and state-of-the-arts on (EyeQ) dataset 


[7] 
Accuracy Sensitivity |Specificity/Precision|F1-Score 
MCF-Net Model 0.865 0.946 0.51 0.80 0.75 
Our Model - SSIM Loss 0.93 0.94 0.90 0.88 0.90 
Our Model - MS-SSIM Loss} 0.935 0.93 0.91 0.94 0.93 
Our Model - MAE Loss 0.94 0.95 0.88 0.90 0.91 
Our Model - MSE Loss| 0.942 0.954 0.89 0.90 0.91 


5 Interpretation of Model Features 


the proposed model Shown in the Figure 3 interprets fundus images with 
scores with explanability to help doctors and medical care workers distin- 
guish gradable and non-estimable images based on grades and interpretations 
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that have been adopted after sending them to the General Hospital in Tar- 
ragona,spain and presenting them to a group of experts to confirm the results 
of the model classification Which has proven to be successful and superior to 
the normal model. 
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Fig. 3. General overview of the autoencoder model with explanation. 


We use various interpretability methods to understand the Normal and Au- 
toencoder models and compare their internal representations. Our approach 
employs: 


e Saliency map methods such as GradCAM visualizations to understand the 
relevance of the input regions [9] . 


6 Conclusions and future work 


In our research paper, we proposed a supervised deep learning model based 
on an autoencoder network. The autoencoder network is able to generate the 
same network input as fundus images to correctly identify the visual features 
of eye image quality. Our model also includes a classifier fed by features ex- 
tracted from the encoder network to rank the quality from the retinal image 
to Gradable and Ungradable. In addition, by analyzing the interpretability 
analysis, we show that the gradability models mainly focus on the presence 
and type of blood vessels in the fundus image. Other key structures such as 
the optic disk and macula seem to play a lesser role than expected. Finally, 
via this analysis, we also found that the addition of the decoder and corre- 
sponding loss helps the proposed model focus more on relevant structures of 
the fundus image 
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1 Abstract 


The use of magnetic resonance imaging (MRI) in prostate segmentation, diag- 
nosis, and treatment is critical. Using MRI images of all modalities, computer- 
aided diagnostic (CAD) systems based on machine learning can assist doctors 
in detecting prostate cancer and its aggressiveness at an earlier stage. One 
of the most important stages of CAD systems is automatic prostate gland 
delineation. With medical images, deep learning has lately exhibited encour- 
aging segmentation outcomes. We examine the current state-of-the-art of deep 
learning-based techniques for prostate segmentation in MRI images and ex- 
plain their benefits and shortcomings in this paper. In addition, we present a 
new approach for classifying prostate biopsy malignancies in MRI images. We 
want to leverage the segmentation results to extract deep radiomics features 
from MRI prostate images in this way. 


2 Introduction 


The most prevalent malignant tumor in men worldwide is prostate cancer. 
In the diagnosis and treatment of prostate cancer, accurate detection of the 
prostate gland utilizing medical scans is critical. Deep learning-based algo- 
rithms have made significant success in a variety of domains, including com- 
puter vision, natural language processing, and medical imaging diagnosis, ac- 
cording to early attempts. The potentials of deep learning-based approaches 
for medical imaging segmentation are still being investigated in the literature. 
And, as it is observed, the findings of automated prostate detection are still 
difficult to come by. 

The main goal of this study is to compare current deep learning-based 
approaches for prostate cancer detection in MRI scans. Each segmentation 
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model’s advantages and disadvantages are highlighted. The results of seg- 
mentation on three public datasets will be presented: Promisel2, ISBI Chal- 
lenge2013, and ProstateX. To evaluate the performance of the prostate cancer 
detection algorithms, we employed the dice coefficient and Hausdorff Distance 
evaluation measures. The use of deep radiomics features acquired from MRI 
images to distinguish benign from malignant prostate cancers is our novel 
contribution. 


3 Methodology 


The recommended methodology [i]for our investigation is depicted in Fig- 
ure | as an overview. We have trained the deep learning models to segment 
prostate cancer from MRI scans. The training is accomplished on a set of pub- 
lic datasets such as Promise12 [5], ISBI2013 [7], and ProstateX [2]. Both 2D 
and 3D segmentation models are essential to our strategy. We employed the 
U-Net [9] and 2D-Unet [3] models for two-dimensional segmentation. 3DFCN 
[8], 3D-Unet [4], and MS-Net [6] have all been trained for 3D segmentation. 

We separated the data in our experiments into training and testing data. To 
boost the amount of training data, we used a data augmentation approach. 
To evaluate the models’ performance, we compute the Dice coefficient and 
Hausdorff distance. The models discussed above were tested on the ProstateX 
citer31 dataset. This stage lays the groundwork for our ultimate goal, which 
is to extract deep learning-based radiomics to categorize malignant malignan- 
cies. 
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Fig. 1. The schematic illustration of the proposed methodology. 
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We used the Promisel2 and ISBI2013 datasets to train multiple segmentation 
models in order to assess their performance. The segmentation findings in 
terms of the Dice coefficient and Hausdorff Distance are presented in Tables 
1 and 2. 


Table 1. Comparing the performance of the segmentation models using ISBI2013 
dataset. 


Model DSC+std |HD(mm)+std 
MS-Net [6] |0.899 +1.960) 9.511 £4.011 
2D-Unet [3]/0.901 £0.015| 6.030 +3.082 
3D-Unet|4] |0.722 +0.020) 17.761 42.924 


Table 2. Comparing the performance of the segmentation models using Promisel2 
dataset. 


Model DSC+std |HD(mm)+std 
U-Net/[9] | 0.88040.041 | 17.690 +2.087 
3D-FCN[S] [0.790 £0.050) 12.910 £4.005 
2D-Unet [3]|0.899 +0.021) 7.661 +3.924 


With the ISBI and Promisel2 datasets, the 2D-optimised Unet delivers 
the best dice coefficient, as shown in Tables 1 and 2. The Hausdorff Distance 
(HD) of the 2D-Unet model is 6.03 mm. These findings show that the 2D-Unet 
model also produces accurate segmentation results on the ProsateX Dataset, 
as shown in Table 3. 


Table 3. Segmentation result on prostatex dataset. 


| Model DSC+std |HD(mm)+std 
| U-Net([9] 0.791+0.151} 17.020 +2.884 
'3D-UNet [4] |0.701 +0.078| 18.001 3.108 
| 3D-FCN [8] |0.721 +0.047| 13.411 +5.264 
'2D-UNet [3] [0.89850.051| 7.690 £2.987 


5 Conclusion 


A comparison study of the state-of-the-art of deep learning-based segmenta- 
tion algorithms for prostate cancer in MRI images has been reported. Different 
measures, such as the dice coefficient and Hausdorff Distance, were utilized 
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to evaluate the performance of the evaluated models. The deep radiomics will 
next be extracted and fed into a classifier to distinguish between prostate 
cancer groups (e.g., benign or malignant). 
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1 Introduction 


Breast cancer is one of the leading causes of cancer related death for women 
worldwide and poses a growing health problem, the most urgent is to diag- 
nose breast cancer in early stage. In the last decades, computer-aided diag- 
nosis (CAD) systems have been introduced to help for physicians. It doesn’t 
only create and analyze images, but also becomes an assistant and helps doc- 
tors with their interpretation. Deep learning methods, especially convolutional 
neural networks (CNNs), have been successfully applied to lesion segmentation 
in breast ultrasound (BUS) images. In our research, we employ state-of-the-art 
deep learning-based semantic segmentation for breast tumor segmentation in 
ultrasound images. An example of the benign and malignant tumors is shown 
in Fig.1. 


Fig. 1. begnin tumor malignant tumor 


* PhD advisor: Dr.Doménec Savi Puig Valls, Dr. Mohamed AbdelNasser, and Dr. Miguel 
Angel 


2 Nadeem Issam Zaidkilani 
2 Problem Statement 


Semantic image segmentation, which assigns per-pixel predictions of object 
categories for the given image, is a fundamental problem in computer vision. 
In the last years, many methods achieved an impressive result in the image 
segmentation problem, however the collection of labeled data for the task of 
semantic segmentation is expensive and time-consuming, as it requires dense 
pixel-level annotations. While deep CNNS based semantic segmentation ap- 
proaches have achieved impressive results by using large amounts of labeled 
training data, their performance drops significantly as the amount of labeled 
data decreases [1]. 

Many deep learning architectures have been proposed to solve segmenta- 
tion problem in the medical images like FCN, SegNet UNET, and GAN [2]. 
The UNET architecture [3] is the state-of-the-art in the medical image seg- 
mentation. UNET was the first architecture designed especially for medical 
image segmentation. It achieved a good result on the small dataset. UNET 
has encoder-decoder structure, which reduces the spatial dimension to ex- 
tract features and then leverages up sampling to recover spatial extent. so 
it uses skip connections to preserve the spatial information, which is help in 
improving the segmentation task. The Encoder-Decoder architecture [3] is 
a neural network structure based on FCN improvements. The architecture is 
mainly composed of two parts, in which the encoder captures deep semantic 
information through several down-sampling processes; the decoder part grad- 
ually restores the space and detail information of the input image through 
several up-sampling operations. Recently, many deep learning based models 
have been proposed for breast tumor segmentation in BUS images, especially 
fully convolutional network (FCN) [4] and U-Net, have been successfully ap- 
plied to this field and achieve outstanding performance for instance, Yap et 
al. [5] developed several FCN-based variants for the semantic segmentation 
of breast lesion in BUS images. With a dataset of 113 malignant and 356 
benign BUS images, they achieved a dice score of 0.7626 on benign. Lesions 
whereas achieved 0.54 on malignant Lesions. Almajalid et al. [6] modified 
and improved U-Net for lesion segmentation based on the contrast enhanced 
and speckle-reduced BUS images. With a dataset of 221 BUS images, they 
achieved a dice score of 0.825. 

However, breast tumor segmentation in BUS images segmentation remains 
an open problem due to the poor image quality and large variations in the 
sizes, shapes, and locations of breast lesion. In our research we used different 
semantic segmentations architectures to segment breast tumors in image. we 
compared the performance of different loss functions with different semantic 
segmentation models. 
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3 Methodology 


In our research, we developed breast tumor segmentation models based on 
deep learning CNN models namely UNET and RESUNET with different loss 
functions. Specifically, we used various loss functions in order to check which 
one is more suitable to our data set and more effective, we chose to use them 
with UNET model, the data set used in this research is provided by UDIAT 
Diagnostic Centre of Sabadell, Spain. the size of the data set were 163 images 
with its grounds truth images. We divided the dataset as 113 images as train 
data and 50 images as test data, for training, we have used batch size of 
100 and Adam optimizer with learning rate 0.0001 and with 20 epochs, and 
standard data-augmentation techniques (rotation range, width shift range, 
height shift range, shear range, zoom range, and horizontal flip) are applied, 
after we performed the data augmentation on the training dataset, the training 
dataset is increased to 2260 images. We have performed experiments using 
different loss functions, the loss functions can be defined as follow: 


1. Cross entropy measure of the difference between two probability distri- 
butions for a given random variable or set of events 


LBCE (y,y*) = — (ylog (“y) + (1 — y) log (1 y'*)) (1) 


2. Dice Coefficient loss measure of overlap between the predicted sample 
and targeted sample, it’s used for the binary data 


Dice = 1-2|AN BI/|A| + |B (2) 


3. Focal Loss It is an improved version of Cross-Entropy Loss (CE) It 
down-weights the contribution of easy examples and enables the model to 
focus more on learning Hard examples. It works well for highly imbalanced 
class Scenario. So an extra parameter added (1- pt) to the cross-entropy 
loss, with a tunable focusing parameter 0. So focal loss can be defined as 


FL(pt) = —at(1 — pt)ylog(pt) (3) 


4. Tversky loss It’s a generalization of Dice’s coefficient. It adds a weight 
to FP (false positives) and FN (false negatives) 


5. Boundary with dice Boundary loss, which takes the form of a distance 
metric on the space of contours, not regions. This can mitigate the dif- 
ficulties of highly unbalanced problems because it uses integrals over the 
interface between regions instead of unbalanced integrals over the regions 
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4 Preliminary results and future work 


We evaluated the segmentation performance of the proposed experiments are 
conducted on the UNET and ResUNET, and UNET outperformed the 
ResUNET since it achieved 0.823, whereas the ResUNET achieved 0.767 


Fig. 2 shows the comparison results of the two approaches 


input Ground Truth UNET RESUNET 


Fig. 2. Example of the predicted segment of the breast tumor for both UNET and 
ResUNET models Note: Cyan (TP) Red (FP) Yellow(FN) Background (TN) 


We decided to evaluate the loss functions with UNET model since its out- 
performed the ResUNET model, the Table 1 shows the performance of the 
model with each loss function, the tversky loss function with tuned hyper- 
parameter is outperformed the remaining mentioned loss functions 
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Methods Accuracy|Dice|IOU (jaccard)|Sensitivity |Specificity 
Cross entropy 0.980 0.823}0.721 0.663 0.997 
DICE COEF 0.9976 0.849]0.738 0.806 0.997 
Tversky ( alpha=0.3,beta=0.7,smooth=1e-6) |0.9967 0.861]0.755 0.859 0.9972 
Focal Loss alpha =0.25,gamma =2) 0.9970 0.826|0.707 0.797 0.9962 
BCE-+dice (took means of them ) 0.9968 0.818]0.696 0.735 0.9982 
boundary with dice 0.9956 0.805|0.674 0.725 0.9979 


Table 1. Evaluation metrics of the UNET with various loss functions 


Input image 


Ground truth Cross entropy DICE COEF Tversky Focal Loss 


BCE with dice 


2 
5 
ry 


boundary witt 


Fig. 3. Example of the predicted segment of the breast tumor of UNET model with 
various loss functions 


In the future work we will perform FCDensenet model on our dataset, also 


we will evaluate FCDensenet model with Dual Attention, Dilated convolution 
and Multiscale contextual information. 
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1 Introduction 


In 1989, LeCun et al. [7] devised the first Convolutional Neural Network 
(CNN), which mimicked the organization of neural cells in the visual cortex 
as convolutional filters. This new type of neural network was able to recog- 
nize 10 digits in hand-written text very accurately. The majority of existing 
CNN models deal with the basic Red-Green-Blue (RGB) color values from in- 
put pixels. Despite this is the obvious choice taking into account that digital 
images are usually encoded with RGB, it’s curious that very few researchers 
have attempted to train their networks on images encoded with other color 
spaces such as Hue-Saturation-Lightness (HSL) or CIE-LAB, the definition 
of which are vastly known and long-standing in the fields of color perception 
[2] and colorimetry [6]. The rationale behind trying other color spaces than 
RGB is based on evidences that the human color vision transforms the initial 
neural signals from cones and rods into an opponent color model [5], where 
several layers of neurons convert the Short, Medium and Large wavelength 
neural signals, loosely related to blue, green and red hues, into other neural 
signals. In regards to the human color perception [2], these opponent signals 
are further processed and converted into perceptual color components, named 
as Hue, Saturation and Lightness. There are several computational models 
that convert RGB into HSL-related components, for example, Smith’s HSI [3] 
and Yagi’s HSV [4]. 


2 Materials and methods 


In order to check our hypothesis, we will perform image classification experi- 
ments on the CIFAR-10 dataset [1], which consists of 60k 32x32 RGB labelled 
images, belonging to 10 different classes: airplane, automobile, bird, cat, etc. 
These images are taken from natural and uncontrolled lightning environment, 
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contain only one prominent instance of the object to which the class refers, 
and the object may be partially occluded or seen from an unusual viewpoint. 
We aim to explore a simple CNN able to obtain a reasonable test accuracy 
(above 80%) in the CIFAR-10 image classification task. We compare its be- 
havior (accuracy variation, patterns in first layer filter, etc.) from the basic 
RGB to the CIE-LAB encodings. These experiments were made with a Free 
Pascal based neural network API [8]. 


3 Experiments 
As a baseline, we defined a single-branch CNN architecture small enough to 


classify CIFAR-10 dataset with at least 80% test accuracy. This single-branch 
architecture is shown in figure 1. 


Fig. 1. Graphical representation of the single-branch baseline CNN architecture. 


One of the purposes of our research is to create an architecture that takes 
advantage of separated chromatic and achromatic channels, which are readily 
available in color spaces such as CIE-LAB or HSV, as explained in the intro- 
ductory section. To this aim, we propose to create two separate paths for the 
first convolutional layer, each one dedicated to each type of pixel information 
(achromatic/chromatic), in order to specialize the first layer filters of the CNN 
to the mentioned aspects of the scene (light variations, object boundaries). We 
hypothesize that this specialization may lead to a better object identification, 
as a consequence of a more object-related representation of the image content. 
Figure 2 shows the proposed two-branch architecture, where the top branch 
processes the single achromatic channel while the bottom branch processes 
the two chromatic channels. For example, we can convert RGB into CIE-LAB 
color encoding, hence the L channel is fed into the top branch, while the AB 
channels are fed into the bottom branch. 
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Fig. 2. Graphical representation of the single-branch baseline CNN architecture. 


4 Results 


As shown in the table 1, our baseline RGB model obtained 84.4% accuracy 
with 15.5 million floating point operations on the forward pass while our 
two-paths model obtained 84.7% accuracy with 11.7 million flops meaning a 
reduction about 29% in the required forward pass computation. The figure 3 
shows the L (achromatic path) and the AB (chromatic path) learned patterns. 


model color space accuracy million flops 
baseline RGB 84.4% 16.5 
two-paths LAB 84.7% 11.7 


Table 1. RGB baseline and LAB two-paths results. 


5 Conclusions 


By splitting LAB filter values into two branches, one for L and another for AB, 
we can force a CNN to find prototypical sets of achromatic/chromatic filters 
allowing the CNN to achieve similar accuracy while decreasing the required 
computation. In essence, we have devised a modification of the first layer of a 
CNN into two branches, which optimizes the number of weights when dealing 
with a color encoding that separates achromatic from chromatic channels, 
such as LAB, HSL, etc. Although the proposed architecture does not increase 
the validation accuracy significantly, it points out that uncorrelating the input 
features eases the learning task of any CNN. As a future work in this line, we 
aim to find out other “correlations” in mid-level or high-level layers, hence we 
may be able to specialize the network neurons to different types of information. 
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Fig. 3. Learned patterns in the L and the AB paths. 
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1 Introduction 


Breast cancer is one of the most common malignancies in women worldwide 
and a leading cause of death [1]. On the other hand, early diagnosis has been 
repeatedly shown to reduce overall disease burden and mortality and help to 
get successful treatment. The classical imaging diagnostic tools used for breast 
cancer screening are mammography (X-ray images of the breast) and breast 
ultrasonography (BUS). However, the 2-D imaging modality of these images 
causes the presence of high breast density (dense fibro glandular tissue in the 
breast) which limits the sensitivity and specificity of breast lesion detection 
[2]. 

Digital breast tomosynthesis (DBT) which is a new 3-D breast cancer 
screening technique, has the ability to address the limitation of tissue overlap- 
ping and superimposition in mammography [3] by providing superior tissue 
visualization which yields enhanced breast lesion detection rate. However, as 
the number of slices to evaluate grows, physicians’ oversight of findings in- 
creases which creates clinical workflow challenges since it is necessarily that 
radiologists are required to examine a greater number of slices per breast vol- 
ume. As a result, computer-aided detection (CAD) is regarded as the ideal 
solution for clinical DBT and plays a greater clinical role in improving work 
performance than traditional digital mammography. Furthermore, Due to the 
higher mass margin visibility in DBT images, it is also probably that CAD 
will perform better than with mammographic images [4]. 

Despite the fact that many automated lesion detection approaches for ac- 
curately detecting breast cancer in mammographic images have been proposed 
in the literature, alongside the lack of enough annotated DBT images which 
held back the number presented detection methods for DBT, breast cancer 
detection in mammographic and DBT images is still a challenging task. In this 
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Work, we present an automated deep learning-based breast lesion detection 
method for DBT images based on investigating the impact of two data aug- 
mentation techniques called channel-replication and chancel-concatenation in 
improving the breast lesion detection results of robust object detection models 
like YOLO [5] and Faster R-CNN [6]. 


2 Deep Learning Based Breast Cancer Detection System 


Deep learning is a part of machine learning that has revolutionized the area 
of computer vision and has been employed in various of medical detection 
application including breast cancer detection. The key elements of our pro- 
posed method are data augmentation, deep learning based detector, and non- 
maximum suppression (NMS). 


2.1 Data augmentation 


Data augmentation techniques are used in deep learning by implementing 
different image manipulation algorithms to increase the number of training 
images. In this work, We analyze the effect of two different data augmentation 
techniques. 


e Channel-replication. In this practice, the N training images are in- 
creased by 6N through flipping all images in the training set horizontally, 
then gamma correction is applied for each image J, (original and flipped 
image) following the Equation 1 to adjust the overall brightness of an 
image to generate I,. In addition,J,jqgne images is generated by applying 
the contrast limited adaptive histogram equalization (CLAHE) to enhance 
the image Local Contrast [7]. To calculate the clip limit for the CLAHE 
algorithm, we follow Equation 2. 


Is 
BBS) (1) 


Where I, is the output image for gamma correction and y is the gamma 


1.05556 


correction factor. 


cliplimit = 


“Hs 2 (Sed) 


Where W x 4 is the number of pixels in each histogram calculated region, 
L is the number of gray-scales, a is a clip factor, and Smaz is the maximum 
allowable slope. 

e Channel-concatenation. In this practice, unlike the traditional aug- 
mentation techniques, the number of data is not increased. But, a new 
3-channel training images (I) has been produced by concatenating the 
original image with two post-processed images as shown in Equation 38, 
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following the idea in [8]. The two filtered images (I, with y = 0.5 and 
Ietahe With a = 1) is concatenated with the original gray-scale image J,. 


L= Concot lola; leaks) (3) 


Here, I,Jg,l,, and Iciane is output image, image after gamma correction 
and image after CLAHE equalization, respectively. 


2.2 Deep learning based competent detection models 


We used two widely known and efficient deep learning-based object detectors: 
YOLO [9] and Faster-RCNN [6], to develop the individual deep learning-based 
detection models. 

In this work, we employed YOLO Version 5, which is now the most ad- 
vanced object detection algorithm of the YOLO family available. It is a novel 
approach that detects objects in real-time with great accuracy. It uses a sin- 
gle neural network based on convolutional neural network (CNN) to process 
the entire image then separates it into parts and predicts bounding boxes and 
probabilities for each component. YOLOv5 is available in four models, namely 
(YOLO-Small (S), YOLO-Medium (M), YOLO-Large (L), and YOLO-XLarge 
(XL), 

In addition, faster R-CNN detector is employed for further attestation of 
the proposed approach. Faster R-CNN the most widely used state of the art 
version of the R-CNN family. It comprises four major parts: 1) a feature 
extractor stage-usually using a CNN, 2) a region proposal (RPN) algorithm 
which utilise a CNN network instead of using a selective search algorithm to 
predict bounding boxes of possible objects in the image with a confidence score 
that yield accelerating training time and improving feature representation, 3) 
a classification layer to predict which class this object belongs to, and 4) a 
regression layer to make the coordinates of the object bounding box more 
precise. 


2.3 Implementation 


Firstly, the DBT images dataset was divided patient-wise into training and 
testing sets. During the training phase, we use the data augmentation tech- 
niques described in section 2.1 to generate two training sets (training set by 
channel-replication augmentation and training set by channel-concatenation 
augmentation). Then, we train each of the detectors mentioned in section 2.2 
individually for each of these training sets. 

Second, the trained models are used to predict bounding boxes for each 
DBT image in the test set during the testing phase. A single bounding boxes 
list contains all predicted bounding boxes from a single DBT image is passed 
to the (NMS) algorithm, which selects the best bounding box from a set of 
overlapping or duplicated boxes. 
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2.4 Experimental results 


Table 1 presents a quantitive comparison between the four YOLOv5S models 
(YOLO-S, YOLO-M, YOLO-L and YOLO-XL) and the faster R-CNN model 
trained on the both training dataset produced by channel-replication augmen- 
tation and channel-concatenation augmentation in terms of true positive rate 


(TPR), Fl-score, and mean average precision-mAP (IoU threshold = 5). 


Table 1: The performance of the deep learning detection methods [10] 


Augmentation Channel-replication Channel-concatenation 
YOLOv5 YOLOv5 

Models S ML XL Faster R-CNN 3S ML XL Faster R-CNN 

TPR 38.8 31.8 24.2 22.7 50 47 39.4 39.4 47 56.1 

F1-Score 48.5 45.7 36.1 35.7 54.1 52.5 51.7 56.6 51.4 57.4 

mAP [iou = 0.5]/31.8 34.1 26.2 26.7 45.1 48.7 38.9 40.4 41.8 46.8 


As one can see, YOLO-S achieved the best lesion detection results when 
compared to the other YOLO models for channel-replication. However, faster 
R-CNN could be more suitable for DBT images as it has more promising 
breast lesion detection results that surpassed all YOLO models on all mea- 
sures. It is notable that training the deep learning detectors based on channel- 
concatenation yields noticeable improvements on all metrics [10], where the 
performance of the YOLO-S model increased by 17 points in terms of mAP. 
Besides, the TPR and F1-score of the faster RCNN were also advanced 6% 
and 3.3%, respectively. On the basis of the above analysis, we can conclude 
that channel-concatenation data augmentation technique can significantly im- 
prove the breast lesion detection results for deep learning-based breast lesion 
detectors like YOLO models and faster R-CNN. 


3 Conclusions 


In this work, we present the strength of two data augmentation strategies 
(channel-replicate and channel-concatenation) while building state of the art 
breast lesion detection models based on deep learning for digital breast. to- 
mosynthesis. 

The study demonstrate that applying the channel-concatenation data aug- 
mentation strategy helps improve the detection accuracy of all deep learning 
models. With a publicly available digital breast tomosynthesis dataset. The 
future work will be focused on the development of a lesion detection approach 
based on the combination of robust deep learning-based detectors. 


Acknowledgement. The Spanish Government partly supported this research through 
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1 Introduction 


Road damage detection is one of the most important issues related to safety, which 
is directly related to human life and vehicles. Most of the basic infrastructure of 
most countries dates back to previous decades. For example, a country like Japan, 
during the boom of economic growth in the late 20” century, extensively built roads, 
bridges, etc[1]. That is, the infrastructure age is now more than 50 years old, and 
needs to be inspected and proper maintenance conducted. 

The process of road inspection and maintenance is time-consuming and costly, 
since this infrastructure extends for thousands of kilometers, and to detect damaged 
parts by traditional methods, requires advanced survey equipment, huge financial 
resources, and experts. For this reason, most municipalities neglect the detection 
procedures[2]. The problem of aging infrastructure is prevalent in other countries 
such as the United States of America[3], and it is considered a vexing problem for 
municipalities. However, the need for efficient and advanced ways to maintain in- 
frastructure has become an urgent necessity. 

Recently, several methods and studies have been conducted to address this prob- 
lem including methods of using laser technology or image processing, in addition, 
using neural networks and machine learning techniques. In 2018, the Road Damage 
Dataset 2018[4] was published and a challenge was held in Seattle, USA, based on 
this dataset. A total of 59 teams from 14 countries participated in this competition, 
all the top results use an ensemble that applies multiple NNs. 

Our work is aimed to detect and classify the road damages in order to facilitate 
decision conducting for road managers to do a proper maintenance according to the 
damage type. To do this, Yolov5 is used in our experiments due to its robustness and 
promising results as well as the Road Damage Dataset 2018. 
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2 Methodology 


2.1 Dataset 


The dataset consists of 9053 labeled road damage images with the resolution of 
600X600 pixels, which are taken from different cities in Japan (Adachi, Chiba, Ichi- 
hara, Muroran, Nagakute, Numazu, and Sumida). In this dataset we have 8 classes of 
road damages separated as follows: DOO, DO1, D10, D20, D40, D43, and D44. These 
classes are defined in the table1. In these 9,053 images there are 15,435 instances are 
distributed to all 8 classes as shown in the figure 


Damage type Detail Class name 
Crack|Linear crack|Longitudinal Wheel mark part DOO 
Construction joint part Dol 
Lateral Equal interval D10 
Construction joint part D11 
Alligator crack Partial pavement, overall pavement} D20 
Other corruption Rutting, bump, pothole, separation D40 
Crosswalk blur D43 
White line blur D44 


Table 1: Source: Road Maintenance and Repair Guidebook 2013 (JRA, 2013) in 
Japan. Note: In reality, rutting, bumps, potholes, and separations are different types 
of road damage, but it is difficult to distinguish these four types using images. There- 
fore, they were classified as one class, namely, D40. 


Dol D10 


Dll 


D20 D40 D43 D44 


Fig. 1: The instances of each class in the Road Damage Dataset 2018. 


Road Damage Detection Using Yolov5 3 


ae > > & & 


Nano Small Medium Large XLarge 
YOLOv5n YOLOv5s YOLOv5m YOLOv5SI YOLOv5x 
4 MB...45 14 MB.n46 41 MB... 89 MB... 166 MB...45 
6.3 ms, 6.4 ms, 8.2 ms, 10.1 ms,,,,, 12.1 ms, 
28.4 MAP coco 37.2 MAP .oco 45.2 MAP coco 48.8 MAP ooo 50.7 MAP coco 


Fig. 2: YOLOv5 Models Comparison. 


2.2 Train models using Yolov5 


Yolov5S has multiple varieties of pretrained models as shown in the figure2 trained 
on the COCO dataset. In our experiments, we started the training process with 
YOLOvSx pre-trained model as initial weights with different image sizes (448 and 
608) and with YOLOv5x6 pre-trained model as initial weights with image sizes (448 
and 576), all with the default hyperparameters. 


3 Results 


Figure3 shows the predictions results compared to the ground truth, the best model 
model we got achieved an Fl-score of 0.631, where this result without applying 
any improvements such as test-time augmentation or model ensembling. Applying 
the test time augmentation and model ensembling led to improving the predictions 
almost in all metrics, table2 shows the results in details. 


Yv5x_448 | Yv5x_608 | Yv5x6_576/|TTA |Precision| Recall| mAP@.5|/mAP@.5:.95|F1-score 
xX 0.644 |0.617] 0.64 0.364 0.63 
xX xX 0.634 |0.614) 0.633 0.361 0.623 

xX 0.629 | 0.633} 0.625 0.359 0.631 

xX xX 0.695 | 0.608) 0.647 0.374 0.648 

xX 0.617 | 0.644] 0.642 0.364 0.63 

xX xX 0.613 | 0.657) 0.65 0.37 0.634 

xX xX xX 0.631 | 0.649) 0.658 0.378 0.639 
xX xX xX 0.59 |0.675| 0.664 0.376 0.629 
xX xX xX 0.648 |0.641] 0.662 0.378 0.644 

xX xX xX xX 0.629 |0.657| 0.666 0.381 0.642 


Table 2: The results for all trained models 
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(a) Labels of a batch of 16 images (b) Predictions of a batch of 16 images 


Fig. 3: Comparison between the Groundtruth and prediction 


4 Conclusions 


The results obtained using Yolov5 show that our approach was able to achieve results 
close to the state_of_the_art, where we can get a mAP@.5 up to 0.666, mAP@.5:.95 
up to 0.381 and F/ score up to 0.648. 
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